World Development Indicators¶

This notebooks performs an exploratory data analysis for the three world development indicators, namely: Access to electricity, Hospital beds and Gross Domestic Product. The data was retrieved from: DataBankWorld Development Indicators.

World

In [1]:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sbn
import plotly.graph_objects as go
import plotly.express as px

1. Import Data¶

Data was reimported from the preprocessing notebook resulst and loaded back to this notebook.

In [2]:
gdp_df = pd.read_csv("data/gdp_one_df.csv", low_memory=False)
hbd_df = pd.read_csv("data/hbd_one_df.csv", low_memory=False)
ate_df = pd.read_csv("data/ate_one_df.csv", low_memory=False)
In [3]:
# We create a dataframe from the combined dataframes that we have previously loaded.
all_df = gdp_df.copy(deep=True)
all_df["HBD"] = hbd_df["HBD"]
all_df["ATE"] = ate_df["ATE"]
all_df.tail(2)
Out[3]:
CountryName CountryCode Year GDP HBD ATE
1138 Vanuatu VUT 2021 9.834693e+08 0.0 0.0
1139 Vietnam VNM 2021 3.626375e+11 0.0 0.0
In [4]:
# Export merged data for backup
all_df.to_csv("data/all_df.csv",  index=False)
In [5]:
# Create a list of all countries
list_countries = all_df['CountryName'].unique().tolist()
In [6]:
# Create a list of all the years in the data set
list_years = all_df['Year'].unique().tolist()

2. Quick Visualization¶

Let's take a quick look at our data using these interactive plots!

In [7]:
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())

fig = go.Figure()
for country,p in zip(list_countries, pal):
    fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['GDP'], name = country, line_color = p, fill=None))
fig.show()
In [8]:
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())

fig = go.Figure()
for country,p in zip(list_countries, pal):
    fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['HBD'],name = country, line_color = p, fill=None))
fig.show()
In [9]:
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())

fig = go.Figure()
for country,p in zip(list_countries, pal):
    fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['ATE'],name = country, line_color = p, fill=None))
fig.show()

3. Describe our Data¶

It is best if we describe our data so we would get an idea of how it is distributed, the next few images will help us with understanding the data.

In [10]:
all_df.describe()
Out[10]:
Year GDP HBD ATE
count 1140.000000 1.140000e+03 1140.00000 1140.000000
mean 2015.500000 5.044162e+11 2.96679 87.553240
std 3.453568 1.431225e+12 2.60405 29.149339
min 2010.000000 3.210420e+07 0.00000 0.000000
25% 2012.750000 7.447009e+09 0.82250 97.897501
50% 2015.500000 5.621421e+10 2.67000 100.000000
75% 2018.250000 4.078550e+11 3.86750 100.000000
max 2021.000000 1.773406e+13 16.46000 100.000000
In [11]:
all_df.dtypes
Out[11]:
CountryName     object
CountryCode     object
Year             int64
GDP            float64
HBD            float64
ATE            float64
dtype: object
In [12]:
# Checking the distribution of our data using histograms
all_df.hist(bins=5, figsize=(20, 10));
In [13]:
# Checking for correlation between our columns
fig = px.imshow(all_df.corr())
fig.show()
In [14]:
# Another way of checking relationships between our columns
sbn.pairplot(all_df)
plt.show()
In [15]:
# Checking column relationships per country
# This may be a little hard to interpret due to the multiple countries listed, 
# we can reimpliment this for each country to better analyze.
sbn.pairplot(all_df, hue='CountryName')
plt.show()

4. Check for Outliers¶

There are a few outliers in our data but removing them may be less helpful since we have time series dataset with yearly time grain which means our data is very limited as it is.

In [16]:
all_df.plot(kind='box', subplots=True, layout=(2,7),sharex=False,sharey=False, figsize=(20, 10), color='darkorange');

5. Create a prediction model for GDP¶

In the excel sheet, we tried to create a forecast for GDP per country, let's try and do that quickly by using all of our data and see if we can get an accurate model to predict our GDP.

In [17]:
# Let's start by creating a copy of our dataset
pred_df = all_df.copy(deep=True)
In [18]:
pred_df.drop(columns=["CountryCode"],axis=1, inplace=True)
pred_df.tail(5).T
Out[18]:
1135 1136 1137 1138 1139
CountryName Ukraine United Kingdom Uzbekistan Vanuatu Vietnam
Year 2021 2021 2021 2021 2021
GDP 200085537744.354004 3186859739185.02002 69238903106.173798 983469256.849629 362637524070.968994
HBD 0.0 0.0 0.0 0.0 0.0
ATE 0.0 0.0 0.0 0.0 0.0
In [19]:
pred_df.CountryName.value_counts()
Out[19]:
Albania             12
Netherlands         12
Poland              12
Philippines         12
Papua New Guinea    12
                    ..
Greenland           12
Greece              12
Gibraltar           12
Germany             12
Vietnam             12
Name: CountryName, Length: 95, dtype: int64
In [20]:
# Turn categorical variables into numbers or categorical codes 
for label, content in pred_df.items():
     if pd.api.types.is_string_dtype(content):
        # Turn categories into numbers and add 
        pred_df[label] = pd.Categorical(content).codes 
In [21]:
# Let us create a sample to test if we can fit our data into a model
X_train, y_train = pred_df.drop("GDP", axis=1), pred_df.GDP
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1, random_state=42, n_estimators = 100, max_samples=100)
model.fit(X_train, y_train)
Out[21]:
RandomForestRegressor(max_samples=100, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_samples=100, n_jobs=-1, random_state=42)
In [22]:
# Create training and validation dataset
df_val = pred_df[all_df.Year == 2021]
df_train = pred_df[all_df.Year != 2021]

len(df_val), len(df_train)
Out[22]:
(95, 1045)
In [23]:
# Split data into X & y
X_train, y_train = df_train.drop("GDP", axis=1), df_train.GDP
X_valid, y_valid = df_val.drop("GDP", axis=1), df_val.GDP

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape
Out[23]:
((1045, 4), (1045,), (95, 4), (95,))
In [24]:
# Create function to evaluate our model
from sklearn.metrics import mean_squared_error, mean_absolute_error

model_accuracies = []

def show_scores(estimator, model):
    """
    Displays the MAE, RMSE and R-squared of both training and validation.
    This function also appends each result to the model_accuracies list for better comparison.
    """
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Estimator": estimator,
              "Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSE": np.sqrt(mean_squared_error(y_train, train_preds)),
              "Valid RMSE": np.sqrt(mean_squared_error(y_valid, val_preds)),
              "Training R^2": model.score(X_train, y_train),
              "Valid R^2": model.score(X_valid, y_valid)}
    model_accuracies.append(list(scores.values()))
    return scores
In [25]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Create a dictionary containing multiple models/estimators
dict_regressors = {
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=42),
    "ElasticNet": ElasticNet(random_state=42),
    "LinearBayesianRidge": linear_model.BayesianRidge(),
    "LinearRegression": LinearRegression(),
    "RandomForestRegressor": RandomForestRegressor(n_jobs=-1, random_state=42, max_samples=1000)} 

# Re-initialize the model_accuracies list
model_accuracies = []
In [26]:
%%time
# Loop through the regression model dictionary, fit to our training data and score the models.
for model, model_instantiation in dict_regressors.items():
    current_model = model_instantiation
    current_model.fit(X_train, y_train)
    show_scores(model, current_model)
CPU times: total: 1.91 s
Wall time: 925 ms
In [27]:
# Let us convert our model_accuracies list into a DataFrame for better readability
model_accuracies_df = pd.DataFrame (model_accuracies, columns = ['Estimator', 'Training MAE', 'Valid MAE', "Training RMSE", "Valid RMSE", "TrainingR^2", "ValidR^2"])
model_accuracies_df
Out[27]:
Estimator Training MAE Valid MAE Training RMSE Valid RMSE TrainingR^2 ValidR^2
0 GradientBoostingRegressor 1.657705e+11 3.620254e+11 2.653431e+11 7.644505e+11 0.962662 0.845397
1 ElasticNet 6.225162e+11 8.745296e+11 1.335221e+12 2.122869e+12 0.054543 -0.192245
2 LinearBayesianRidge 6.313179e+11 6.573992e+11 1.373195e+12 1.954571e+12 0.000000 -0.010699
3 LinearRegression 6.231085e+11 8.332307e+11 1.334854e+12 2.105567e+12 0.055063 -0.172889
4 RandomForestRegressor 4.777562e+10 4.537450e+11 1.336596e+11 1.018737e+12 0.990526 0.725436
In [28]:
model_accuracies_df.plot(title='Mean Absolute Error', x='Estimator', y=['Training MAE', 'Valid MAE'], figsize=(10,5), grid=True, kind='bar');

Our errors are high but let's use the least erroneous models and disect how they analyzed our data by checking the predictor importance below.

In [29]:
dict_regressors["GradientBoostingRegressor"]
Out[29]:
GradientBoostingRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(random_state=42)
In [30]:
# Get the model feature importance for the GradientBoostingRegressor
sbn.set(rc={"figure.figsize":(15,4)}) 
sbn.barplot(x= dict_regressors["GradientBoostingRegressor"].feature_importances_, y=X_train.columns);
In [31]:
# Get the model feature importance for the RandomForestRegressor
sbn.set(rc={"figure.figsize":(15,4)}) 
sbn.barplot(x= dict_regressors["RandomForestRegressor"].feature_importances_, y=X_train.columns);

All the steps above should have given us some familiarity of how our data behaves.